Sparse Markov net learning with priors on regularization parameters
نویسندگان
چکیده
In this paper, we consider the problem of structure recovery in Markov Network over Gaussian variables, that is equivalent to finding the zero-pattern of the sparse inverse covariance matrix. Recently proposed l1-regularized optimization methods result into convex problems that can be solved optimally and efficiently. However, the accuracy such methods can be quite sensitive to the choice of regularization parameter, and optimal selection of this parameter remains an open problem. Herein, we adopt a Bayesian approach, treating the regularization parameter(s) as random variable(s) with some prior, and using MAP optimization to find both the inverse covariance matrix and the unknown regularization parameters. Our general formulation allows a vector of regularization parameters and is well-suited for learning structured graphs such as scale-free networks where the sparsity of nodes varies significantly. We present promising empirical results on both synthetic and real-life datasets, demonstrating that our approach achieves a better balance between the false-positive and falsenegative errors than commonly used approaches. Introduction In many applications of statistical learning the objective is not simply to construct an accurate predictive model but rather to discover meaningful interactions among the variables. This is particularly important in biological applications such as, for example, reverse-engineering of gene regulatory networks, or reconstruction of brain-activation patterns from functional MRI (fMRI) data. Probabilistic graphical models, such as Markov networks (or Markov Random Fields), provide a principled way of modeling multivariate data distributions that is both predictive and interpretable. A standard approach to learning Markov network structure is to choose the simplest model, i.e. the sparsest network, that adequately explains the data. Formally, this leads to regularized maximum-likelihood problem with the penalty on the number of parameters, or l0 norm, a generally intractable problem that was often solved approximately by greedy search (Heckerman 1995). Recently, even better approximation methods were suggested (Meinshausen & Buhlmann 2006; Wainwright, Ravikumar, & Lafferty 2007; Yuan & Lin 2007; O.Banerjee, El Ghaoui, & d’Aspremont 2008; Friedman, Hastie, & Tibshirani 2007; Duchi, Gould, Copyright c © 2009, authors listed above. All rights reserved. & Koller 2008) that exploit sparsity-enforcing property of l1-norm regularization and yield convex optimization problems that can be solved efficiently. However, those approaches are known to be sensitive to the choice of the regularization parameter, i.e. the weight on l1-penalty, and to the best of our knowledge, selecting the optimal value of this parameter remains an open problem. Indeed, the two most commonly used approaches are (1) cross-validation and (2) theoretical derivations. However, λ selected by cross-validation, i.e. the estimate of the prediction-oracle solution that maximizes the test data likelihood (i.e. minimizes the predictive risk) is typically too small and yields high false-positive rate1. On the other hand, theoretically derived λ (see (O.Banerjee, El Ghaoui, & d’Aspremont 2008)) has asymptotic guarantee of correct recovery of the connectivity components (rather than edges), which correspond to marginal rather than conditional independencies, i.e. to the entries in covariance rather then the inverse covariance matrix. Although such approach is asymptotically consistent, for finite number of samples it tends to miss many edges, resulting into high false-negative error rates. In this paper, we propose a Bayesian approach to regularization parameter selection, that also generalizes to the case of vector-λ, allowing to choose, if necessary, a different sparsity level for different nodes in the network. (This work extends our approach to scalar-λ selection proposed in (Asadi et al. 2009); for completeness sake, we will also summarize here the results from (Asadi et al. 2009)). More specifically, the regularization parameter controlling the sparsity of solution are considered to be random variable with particular priors, and the objective is to find a MAP solution (Θ, Λ), where Θ is the set of model parameters and Λ is the set of regularization parameters. Our algorithm is based on alternating optimization over Θ and Λ, respectively. Empirical results demonstrate that our approach compares favorably to previous approaches, achieving a better balance between the false-positive and false-negative errors. Note that our general formulation is well-suited for learnThis is actually not surprising as it is well known that crossvalidated λ for the prediction objective can be a bad choice for the structure recovery/model selection in l1-regularized setting (e.g., see (Meinshausen & Buhlmann 2006) for examples when λ selected by cross-validation leads to provably inconsistent structure recovery). ing structured networks with potentially very different node degrees (and thus different sparsity of the columns in the inverse covariance matrix). One common practical example of such networks are networks with heavy-tail (power-law) degree distributions, also called scale-free networks. Examples of such networks include social networks, protein interaction networks, Internet, world wide web, correlation networks between active brain areas in fMRI studies (V.M. Eguiluz and D.R. Chialvo and G.A. Cecchi and M. Baliki and A.V. Apkarian 2005), and many other real-life networks (see (Barabasi & Albert 1999) for a survey). Empirical results on both random and structured (power-law) networks demonstrate that our approach compares favorably to previous approaches, achieving a better balance between the false-positive and false-negative errors. Our Approach Let X = {X1, ..., Xp} be a set of p random variables, and let G = (V, E) be a Markov network (a Markov Random Field, or MRF) representing the conditional independence structure of the joint distribution P (X). The set of vertices V = {1, ..., p} is in a one-to-one correspondence with the set of variables in X . The edge set E contains an edge (i, j) if and only if Xi is conditionally dependent on Xj given all remaining variables; the lack of edge between Xi and Xj means that the two variables are conditionally independent given all remaining variables (Lauritzen 1996). We will assume a multivariate Gaussian probability density function over X = {X1, ..., Xp}: p(x) = (2π)−p/2 det(Σ)− 1 2 e− 1 2 (x−μ)TΣ−1(x−μ) (1) where μ is the mean and Σ is the covariance matrix of the distribution, respectively, and x denotes the transpose of the column-vector x. Without loss of generality we will assume that the data are normalized to have zero mean (μ = 0), and we only need to estimate the parameter Σ (or Σ−1). Since det(Σ)−1 = det(Σ−1), we can now rewrite eq. 1, assuming C = Σ−1 and μ = 0: p(x) = (2π)−p/2 det(C) 1 2 e− 1 2x T . (2) Missing edges in the above graphical model correspond to zero entries in the inverse covariance matrix C = Σ−1, and thus the problem of structure learning for the above probabilistic graphical model is equivalent to the problem of learning the zero-pattern of the inverse-covariance matrix. Note that the inverse of the maximum-likelihood estimate of the covariance matrix Σ (i.e. the empirical covariance matrix A = 1 n ∑n i=1 x T i xi where xi is the i-th sample, i = 1, ..., n), even if it exists, does not typically contain any elements that are exactly zero. Therefore an explicit sparsity-enforcing constraint needs to be added to the estimation process. A common approach is to include as penalty the l1-norm of C, which is equivalent to imposing a Laplace prior on C in maximum-likelihood framework (O.Banerjee, El Ghaoui, & d’Aspremont 2008; Friedman, Hastie, & Tibshirani 2007; Yuan & Lin 2007; Duchi, Gould, & Koller 2008). Formally, the entries Cij of the inverse covariance matrix C are assumed to be independent random variables, each following a Laplace distribution p(Cij) = λij 2 e −λij |Cij−αij | with zero location parameter (mean) αij and common scale parameter λij = λ, yielding p(C) = ∏p i=1 ∏p j=1 p(Cij) = (λ/2) 2 e−λ||C||1 , where ||C||1 = ∑ ij |Cij | is the (vector) l1-norm of C. Then the objective is to find the maximum-likelihood solution arg maxCÂ0 p(C|X), where X is the n× p data matrix, or equivalently, since p(C|X) = P (X, C)/p(X) and p(X) does not include C, to find arg maxCÂ0 P (X, C), over positive definite matrices C. This yields the following optimization problem considered in (O.Banerjee, El Ghaoui, & d’Aspremont 2008; Friedman, Hastie, & Tibshirani 2007; Yuan & Lin 2007; Duchi, Gould, & Koller 2008): max CÂ0 ln det(C)− tr(AC)− λ||C||1 (3) where det(Z) and tr(Z) denote the determinant and the trace (sum of the diagonal elements) of a matrix Z, respectively. Herein, we make a more general assumption about p(C), allowing different rows in C to have different parameters λi, i.e., p(Cij) = λi 2 e −λi|Cij | This reflects our desire to model structured networks with potentially very different node degrees (i.e., row densities in C). This yields p(C) = ∏p i=1 ∏p j=1 λi 2 e −λi|Cij | = = ∏p i=1 λpi 2p e −λi ∑p j=1 |Cij |. Moreover, we will take Bayesian approach and assume that parameters λi are also random variables following some joint distribution p({λi}). Given a dataset X of n samples (rows) of vector X, the joint log-likelihood can be then written as lnL(X, C, {λi}) = ln{p(X|C)p(C|{λi})p({λi})} =
منابع مشابه
Priors on the Variance in Sparse Bayesian Learning; the demi-Bayesian Lasso
We explore the use of proper priors for variance parameters of certain sparse Bayesian regression models. This leads to a connection between sparse Bayesian learning (SBL) models (Tipping, 2001) and the recently proposed Bayesian Lasso (Park and Casella, 2008). We outline simple modifications of existing algorithms to solve this new variant which essentially uses type-II maximum likelihood to f...
متن کاملA Map Approach to Learning Sparse Gaussian Markov Networks
Recently proposed l1-regularized maximum-likelihood optimization methods for learning sparse Markov networks result into convex problems that can be solved optimally and efficiently. However, the accuracy of such methods can be very sensitive to the choice of regularization parameter, and optimal selection of this parameter remains an open problem. Herein, we propose a maximum a posteriori prob...
متن کاملThe Bayesian Elastic Net: Classifying Multi-Task Gene-Expression Data
Highly correlated relevant features are frequently encountered in variable-selection problems, with gene-expression analysis an important example. It is desirable to select all of these highly correlated features simultaneously as a group, for better model interpretation and robustness. Further, irrelevant features should be excluded, resulting in a sparse solution (of importance for avoiding o...
متن کاملExpectation propagation for neural networks with sparsity-promoting priors
We propose a novel approach for nonlinear regression using a two-layer neural network (NN) model structure with sparsity-favoring hierarchical priors on the network weights. We present an expectation propagation (EP) approach for approximate integration over the posterior distribution of the weights, the hierarchical scale parameters of the priors, and the residual scale. Using a factorized pos...
متن کاملStepwise Group Sparse Regression (SGSR): Gene-Set-Based Pharmacogenomic Predictive Models with Stepwise Selection of Functional Priors
Complex mechanisms involving genomic aberrations in numerous proteins and pathways are believed to be a key cause of many diseases such as cancer. With recent advances in genomics, elucidating the molecular basis of cancer at a patient level is now feasible, and has led to personalized treatment strategies whereby a patient is treated according to his or her genomic profile. However, there is g...
متن کاملStructured Sparse Principal Component Analysis
We present an extension of sparse PCA, or sparse dictionary learning, where the sparsity patterns of all dictionary elements are structured and constrained to belong to a prespecified set of shapes. This structured sparse PCA is based on a structured regularization recently introduced by [1]. While classical sparse priors only deal with cardinality, the regularization we use encodes higher-orde...
متن کامل